song generation
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.67)
SongCreator: Lyrics-based Universal Song Generation
Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the application of music generation models in the real world. In this light, we propose SongCreator, a song-generation system designed to tackle this challenge. The model features two novel designs: a meticulously designed dual-sequence language model (DSLM) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM, which allows our model to understand, generate and edit songs, making it suitable for various songrelated generation tasks by utilizing specific attention masks. Extensive experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.
LeVo: High-Quality Song Generation with Multi-Preference Alignment
Lei, Shun, Xu, Yaoxun, Lin, Zhiwei, Zhang, Huaicheng, Tan, Wei, Chen, Hangting, Yu, Jianwei, Zhang, Yixuan, Yang, Chenyu, Zhu, Haina, Wang, Shuai, Wu, Zhiyong, Yu, Dong
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- (2 more...)
- Leisure & Entertainment (1.00)
- Media > Music (0.89)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.67)
SongPrep: A Preprocessing Framework and End-to-end Model for Full-song Structure Parsing and Lyrics Transcription
Tan, Wei, Lei, Shun, Zhang, Huaicheng, Li, Guangzheng, Zhang, Yixuan, Chen, Hangting, Yu, Jianwei, Gu, Rongzhi, Yu, Dong
Artificial Intelligence Generated Content (AIGC) is currently a popular research area. Among its various branches, song generation has attracted growing interest. Despite the abundance of available songs, effective data preparation remains a significant challenge. Converting these songs into training-ready datasets typically requires extensive manual labeling, which is both time consuming and costly. To address this issue, we propose SongPrep, an automated preprocessing pipeline designed specifically for song data. This framework streamlines key processes such as source separation, structure analysis, and lyric recognition, producing structured data that can be directly used to train song generation models. Furthermore, we introduce SongPrepE2E, an end-to-end structured lyrics recognition model based on pretrained language models. Without the need for additional source separation, SongPrepE2E is able to analyze the structure and lyrics of entire songs and provide precise timestamps. By leveraging context from the whole song alongside pretrained semantic knowledge, SongPrepE2E achieves low Diarization Error Rate (DER) and Word Error Rate (WER) on the proposed SSLD-200 dataset. Downstream tasks demonstrate that training song generation models with the data output by SongPrepE2E enables the generated songs to closely resemble those produced by humans.
- Media > Music (0.68)
- Leisure & Entertainment (0.68)
Versatile Framework for Song Generation with Prompt-based Control
Zhang, Yu, Guo, Wenxiang, Pan, Changhao, Zhu, Zhiyuan, Li, Ruiqi, Lu, Jingyu, Huang, Rongjie, Zhang, Ruiyuan, Hong, Zhiqing, Jiang, Ziyue, Zhao, Zhou
Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- Asia > China (0.04)
- North America > United States (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation
Zhang, Huaicheng, Tan, Wei, Li, Guangzheng, Zhang, Yixuan, Chen, Hangting, Lei, Shun, Yang, Chenyu, Wu, Zhiyong, Wang, Shuai, Huang, Qijun, Yu, Dong
Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework's transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > China > Hubei Province > Wuhan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Hong Kong (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation
Wang, Tongxi, Yu, Yang, Wang, Qing, Qian, Junlang
Song generation is regarded as the most challenging problem in music AIGC; nonetheless, existing approaches have yet to fully overcome four persistent limitations: controllability, generalizability, perceptual quality, and duration. We argue that these shortcomings stem primarily from the prevailing paradigm of attempting to learn music theory directly from raw audio, a task that remains prohibitively difficult for current models. To address this, we present Bar-level AI Composing Helper (BACH), the first model explicitly designed for song generation through human-editable symbolic scores. BACH introduces a tokenization strategy and a symbolic generative procedure tailored to hierarchical song structure. Consequently, it achieves substantial gains in the efficiency, duration, and perceptual quality of song generation. Experiments demonstrate that BACH, with a small model size, establishes a new SOTA among all publicly reported song generation systems, even surpassing commercial solutions such as Suno. Human evaluations further confirm its superiority across multiple subjective metrics.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > Singapore (0.04)
- Asia > China (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
Liu, Renhang, Hung, Chia-Yu, Majumder, Navonil, Gautreaux, Taylor, Bagherzadeh, Amir Ali, Li, Chuan, Herremans, Dorien, Poria, Soujanya
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- Asia > Singapore (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Liu, Zihan, Ding, Shuangrui, Zhang, Zhixiong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Cao, Yuhang, Lin, Dahua, Wang, Jiaqi
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Research Report (0.51)
- Workflow (0.47)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.66)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)